A Probability Analysis for Frequent Itemset Mining Algorithms
نویسندگان
چکیده
Since the introduction of the Frequent Itemset Mining (FIM) problem, several different algorithms for solving it were proposed and experimentally analyzed. Our work focusses on the theoretical analysis of FIM. The aim is to give a detailed probabilistic study of the performance of FIM algorithms for different data distributions. It is joint work with Dirk Van Gucht and Paul Purdom from Indiana University in Bloomington, USA. The inspiration can be found in important related work [4]. The research considers in detail the probabilities that a set is a candidate and a success or a candidate and a failure, for a collection of well-known FIM algorithms. The Apriori Algorithm [2] is considered in detail; for AIS [1], Eclat [5], FP-growth [3] and the Fast Completion Apriori (FCA) Algorithm [2], the analysis is similar so only the main principles are sketched. The probabilistic analysis is done for different data distributions, covered by a general shopping model where all the shoppers are independent and each combination of items has its own probability of being purchased, so any correlation between items is possible. We focus on algorithms that are candidate-based. An itemset I is called a candidate when its frequency status cannot be deduced by the algorithm based on previous knowledge from other itemsets, but has to be counted explicitly in the database. In practice, I is a candidate if certain associated testsets are already determined to be frequent. For all the algorithms, except for FCA, the testsets are itemsets that are obtained by omitting a single item from I; for FCA, the testsets are all those subsets of I whose size is equal to the level where the regular Apriori Algorithm was last used. The exact frequency status of I is determined by explicitly counting I’s occurrence in the database. If I is frequent, it is called a success; otherwise, it is a failure. The candidacy probability, the probability that an itemset is a candidate, depends on the particular algorithm. On the contrary, all correct algorithms give the same probability that an itemset is frequent, the success probability; it is a property of the data, not of the algorithm. The probability that an itemset is a candidate but not frequent, is called the failure probability; it depends on both the problem instance and the algorithm and is particularly important because it is related to work that a better algorithm might hope to avoid. Our research shows that for each algorithm, the candidacy probability of an itemset is usually determined almost entirely by the frequency probability of a particular testset, but which testset this is, depends on the algorithm. For both versions of the Apriori algorithm, this dominant testset is the one with the smallest probability. For the AIS algorithm, it is the testset with the highest probability. For Eclat-like algorithms, including FP-growth, it is a set that is at least as good as the testset with the second-highest probability and there is a tendency for it to be the one with the second-highest probability. We prove that the candidacy probability of an itemset is near 1 when the probability of this dominant testset is significantly above k/b (with k the user-defined support threshold from the FIM problem and b the number of baskets in the database), and it is near 0 when the probability is significantly below k/b. Similar results are found for the success and failure probabilities. We also show that the algorithms have similar performance on uniform data, whereas they can have hugely different performance on other types of data.
منابع مشابه
A New Algorithm for High Average-utility Itemset Mining
High utility itemset mining (HUIM) is a new emerging field in data mining which has gained growing interest due to its various applications. The goal of this problem is to discover all itemsets whose utility exceeds minimum threshold. The basic HUIM problem does not consider length of itemsets in its utility measurement and utility values tend to become higher for itemsets containing more items...
متن کاملA Review on Algorithms for Mining Frequent Itemset Over Data Stream
Frequent itemset mining over dynamic data is an important problem in the context of data mining. The two main factors of data stream mining algorithm are memory usage and runtime, since they are limited resources. Mining frequent pattern in data streams, like traditional database and many other types of databases, has been studied popularly in data mining research. Many applications like stock ...
متن کاملAnalysis of Association Rule Mining Algorithms to Generate Frequent Itemset
Association rule mining algorithm is used to extract relevant information from database and transmit into simple and easiest form. Association rule mining is used in large set of data. It is used for mining frequent item sets in the database or in data warehouse. It is also one type of data mining procedure. In this paper some of the association rule mining algorithms such as apriori, partition...
متن کاملAnalysis of Frequent Item set Mining on Variant Datasets
Association rule mining is the process of discovering relationships among the data items in large database. It is one of the most important problems in the field of data mining. Finding frequent itemsets is one of the most computationally expensive tasks in association rule mining. The classical frequent itemset mining approaches mine the frequent itemsets from the database where presence of an...
متن کاملProbabilistic analysis of success and failure rates of candidates generation algorithms for the frequent itemsets mining problem
is paper consider the success and failure probability of candidate generation algorithms for the frequent itemsets mining problem under several probability model. Results for one of the models had been obtained previously, but with a complex derivation. Our re-derivation of these results is simpler and employed a concentration inequality for the sum of independent Bernoulli random variables. Ou...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005